-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53996][SQL] Improve InferFiltersFromConstraints to infer filters from complex join expressions #52699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
290491a to
6d72c4c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you're trying to do is to propagate literals in InferFilterFromConstraints.
The crux of the problem is:
- InferFilterFromConstraints DOES consider join conditions and expressions across multiple operators, but it does not consider literals.
- ConstantPropagation considers literals, but does not consider operators outside a single Filter node.
Can we just reuse logic in ConstantPropagation (which is more robust, and historically tested) in InferFilterFromConstraints.getAllConstraints?
| // Avoid inferring tautologies like 1 = 1 | ||
| val isTautology = replaced match { | ||
| case EqualTo(left: Expression, right: Expression) if left.foldable && right.foldable => | ||
| left.eval() == right.eval() | ||
| case _ => false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little complicated and potentially non-performant given that we have to do expression evaluation in the driver during compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion above, I'll try to reuse existing code as much as possible.
Regarding the performance. Expression evaluation already happens in the driver in ConstantFolding rule.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
Lines 49 to 64 in 7c5a9a3
| object ConstantFolding extends Rule[LogicalPlan] { | |
| // This tag is for avoid repeatedly evaluating expression inside conditional expression | |
| // which has already failed to evaluate before. | |
| private[sql] val FAILED_TO_EVALUATE = TreeNodeTag[Unit]("FAILED_TO_EVALUATE") | |
| private def hasNoSideEffect(e: Expression): Boolean = e match { | |
| case _: Attribute => true | |
| case _: Literal => true | |
| case c: Cast if !conf.ansiEnabled => hasNoSideEffect(c.child) | |
| case _: NoThrow if e.deterministic => e.children.forall(hasNoSideEffect) | |
| case _ => false | |
| } | |
| private def tryFold(expr: Expression, isConditionalBranch: Boolean): Expression = { | |
| try { | |
| Literal.create(expr.freshCopyIfContainsStatefulExpression().eval(EmptyRow), expr.dataType) |
…rs from complex join expressions
6d72c4c to
2b09213
Compare
|
@andylam-db Thanks for you suggestion, I simplified the code to reuse logic from |
What changes were proposed in this pull request?
This PR improves the Spark SQL optimizer’s
InferFiltersFromConstraintsrule to infer filter conditions from join constraints that involve complex expressions, not just simple attribute equalities.Currently, the optimizer can only infer additional constraints when the join condition is a simple equality (e.g.,
a = b). For more complex expressions, such as arithmetic operations, it does not infer the corresponding filter.Example (currently works as expected):
In this case, the optimizer correctly infers the additional constraint
t1.a = 1.Example (now handled by this PR):
Here, it is clear that
t1.a = 3(sincet2.b = 1andt1.a = t2.b + 2), but previously the optimizer did not infer this constraint. With this change, the optimizer can now deduce and push downt1.a = 3.How was this patch tested?
You can reproduce and verify the improvement with the following:
Before this change, the physical plan does not include the inferred filter:
With this PR, the optimizer should infer and push down
t2.b = 3as an additional filter.Why are the changes needed?
Without this enhancement, the optimizer cannot push down filters or optimize query execution plans for queries with complex join conditions, which can lead to suboptimal join performance.